Goto

Collaborating Authors

 embedding layer


Stochastic Shared Embeddings: Data-driven Regularization of Embedding Layers

Neural Information Processing Systems

In deep neural nets, lower level embedding layers account for a large portion of the total number of parameters. Tikhonov regularization, graph-based regularization, and hard parameter sharing are approaches that introduce explicit biases into training in a hope to reduce statistical complexity. Alternatively, we propose stochastically shared embeddings (SSE), a data-driven approach to regularizing embedding layers, which stochastically transitions between embeddings during stochastic gradient descent (SGD). Because SSE integrates seamlessly with existing SGD algorithms, it can be used with only minor modifications when training large scale neural networks. We develop two versions of SSE: SSE-Graph using knowledge graphs of embeddings; SSE-SE using no prior information. We provide theoretical guarantees for our method and show its empirical effectiveness on 6 distinct tasks, from simple neural networks with one hidden layer in recommender systems, to the transformer and BERT in natural languages. We find that when used along with widely-used regularization methods such as weight decay and dropout, our proposed SSE can further reduce overfitting, which often leads to more favorable generalization results.


Stochastic Shared Embeddings: Data-driven Regularization of Embedding Layers

Neural Information Processing Systems

In deep neural nets, lower level embedding layers account for a large portion of the total number of parameters. Tikhonov regularization, graph-based regularization, and hard parameter sharing are approaches that introduce explicit biases into training in a hope to reduce statistical complexity. Alternatively, we propose stochastically shared embeddings (SSE), a data-driven approach to regularizing embedding layers, which stochastically transitions between embeddings during stochastic gradient descent (SGD). Because SSE integrates seamlessly with existing SGD algorithms, it can be used with only minor modifications when training large scale neural networks. We develop two versions of SSE: SSE-Graph using knowledge graphs of embeddings; SSE-SE using no prior information.


Reviews: Stochastic Shared Embeddings: Data-driven Regularization of Embedding Layers

Neural Information Processing Systems

The paper presents a novel and interesting regularization method, theoretical analysis and good results, yet I fear its main contributions might be limited to recommendation systems or other fields where knowledge graphs are available, easily constructed, or in their absence, intuitively reasonable to assume a complete graph. Outside those types of tasks, I find it presenting arguments which intuitively were not too compelling, as to why other fields or tasks would significantly benefit from such a method, despite showing improved results on some NLP tasks. The simpler version of the regularizer, which in the absence of a knowledge graph assumes a complete graph, permutes embedding indices with a constant*U(1,N) probability. Despite its appealing theoretical properties, it also poses a risk of introducing a bias of its own. The results on NLP tasks didn't show major improvements and lacked in explanation as to why this type of regularizer would be beneficial and effective for different NLP tasks.


Reviews: Stochastic Shared Embeddings: Data-driven Regularization of Embedding Layers

Neural Information Processing Systems

The paper proposes to integrate a stochastic relabelling embedding operator within the training of a neural net. The reviewers and the area chair are convinced of the merits of the approach which comes with a theoretical justification (smoothing the Rademacher complexity in the uniform case) and solid comparative empirical evidence. The visualization of the embeddings and their interpretation (in supplementary material and in the rebuttal) are appreciated. The AC hopes that the authors will take into account the suggestions/questions in the reviews, specifically concerning the scope of the approach and its limitations, when writing the camera-ready version of the paper. Another question which comes to mind is whether the knowledge graph (e.g. as learned from a teacher network) can facilitate the training of a student network, e.g.


Stochastic Shared Embeddings: Data-driven Regularization of Embedding Layers

Neural Information Processing Systems

In deep neural nets, lower level embedding layers account for a large portion of the total number of parameters. Tikhonov regularization, graph-based regularization, and hard parameter sharing are approaches that introduce explicit biases into training in a hope to reduce statistical complexity. Alternatively, we propose stochastically shared embeddings (SSE), a data-driven approach to regularizing embedding layers, which stochastically transitions between embeddings during stochastic gradient descent (SGD). Because SSE integrates seamlessly with existing SGD algorithms, it can be used with only minor modifications when training large scale neural networks. We develop two versions of SSE: SSE-Graph using knowledge graphs of embeddings; SSE-SE using no prior information.


Textual Data Mining for Financial Fraud Detection: A Deep Learning Approach

arXiv.org Artificial Intelligence

In this report, I present a deep learning approach to conduct a natural language processing (hereafter NLP) binary classification task for analyzing financial-fraud texts. First, I searched for regulatory announcements and enforcement bulletins from HKEX news to define fraudulent companies and to extract their MD&A reports before I organized the sentences from the reports with labels and reporting time. My methodology involved different kinds of neural network models, including Multilayer Perceptrons with Embedding layers, vanilla Recurrent Neural Network (RNN), Long-Short Term Memory (LSTM), and Gated Recurrent Unit (GRU) for the text classification task. By utilizing this diverse set of models, I aim to perform a comprehensive comparison of their accuracy in detecting financial fraud. My results bring significant implications for financial fraud detection as this work contributes to the growing body of research at the intersection of deep learning, NLP, and finance, providing valuable insights for industry practitioners, regulators, and researchers in the pursuit of more robust and effective fraud detection methodologies.


TensorGPT: Efficient Compression of the Embedding Layer in LLMs based on the Tensor-Train Decomposition

arXiv.org Artificial Intelligence

High-dimensional token embeddings underpin Large Language Models (LLMs), as they can capture subtle semantic information and significantly enhance the modelling of complex language patterns. However, the associated high dimensionality also introduces considerable model parameters, and a prohibitively high model storage. To address this issue, this work proposes an approach based on the Tensor-Train Decomposition (TTD), where each token embedding is treated as a Matrix Product State (MPS) that can be efficiently computed in a distributed manner. The experimental results on GPT-2 demonstrate that, through our approach, the embedding layer can be compressed by a factor of up to 38.40 times, and when the compression factor is 3.31 times, even produced a better performance than the original GPT-2 model.


Build a Text Classification Model with Deep Learning

#artificialintelligence

Description: In this project, we will build a deep learning model to classify text documents into different categories. We will use the popular TensorFlow library to build and train the model. Step 1: First, let's import the necessary libraries and load the text data into a Pandas DataFrame. Step 2: Next, let's split the data into training and test sets and create a tokenizer to convert the text into numerical sequences. Step 3: Now, let's use the tokenizer to convert the text into numerical sequences and pad the sequences to the same length.


BASM: A Bottom-up Adaptive Spatiotemporal Model for Online Food Ordering Service

arXiv.org Artificial Intelligence

Online Food Ordering Service (OFOS) is a popular location-based service that helps people to order what you want. Compared with traditional e-commerce recommendation systems, users' interests may be diverse under different spatiotemporal contexts, leading to various spatiotemporal data distribution, which limits the fitting capacity of the model. However, numerous current works simply mix all samples to train a set of model parameters, which makes it difficult to capture the diversity in different spatiotemporal contexts. Therefore, we address this challenge by proposing a Bottom-up Adaptive Spatiotemporal Model(BASM) to adaptively fit the spatiotemporal data distribution, which further improve the fitting capability of the model. Specifically, a spatiotemporal-aware embedding layer performs weight adaptation on field granularity in feature embedding, to achieve the purpose of dynamically perceiving spatiotemporal contexts. Meanwhile, we propose a spatiotemporal semantic transformation layer to explicitly convert the concatenated input of the raw semantic to spatiotemporal semantic, which can further enhance the semantic representation under different spatiotemporal contexts. Furthermore, we introduce a novel spatiotemporal adaptive bias tower to capture diverse spatiotemporal bias, reducing the difficulty to model spatiotemporal distinction. To further verify the effectiveness of BASM, we also novelly propose two new metrics, Time-period-wise AUC (TAUC) and City-wise AUC (CAUC). Extensive offline evaluations on public and industrial datasets are conducted to demonstrate the effectiveness of our proposed modle. The online A/B experiment also further illustrates the practicability of the model online service. This proposed method has now been implemented on the Ele.me, a major online food ordering platform in China, serving more than 100 million online users.


Strictly Breadth-First AMR Parsing

arXiv.org Artificial Intelligence

AMR parsing is the task that maps a sentence to an AMR semantic graph automatically. We focus on the breadth-first strategy of this task, which was proposed recently and achieved better performance than other strategies. However, current models under this strategy only \emph{encourage} the model to produce the AMR graph in breadth-first order, but \emph{cannot guarantee} this. To solve this problem, we propose a new architecture that \emph{guarantees} that the parsing will strictly follow the breadth-first order. In each parsing step, we introduce a \textbf{focused parent} vertex and use this vertex to guide the generation. With the help of this new architecture and some other improvements in the sentence and graph encoder, our model obtains better performance on both the AMR 1.0 and 2.0 dataset.